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Abstract 

Many emerging Web services, such as email, photo 
sharing, and web site archives, need to preserve large 
amounts of quickly-accessible data indefinitely into the 
future. In this paper, we make the case that these applica- 
tions' demands on large scale storage systems over long 
time horizons require us to re-evaluate traditional stor- 
age system designs. We examine threats to long-lived 
data from an end-to-end perspective, taking into account 
not just hardware and software faults but also faults due 
to humans and organizations. We present a simple model 
of long-term storage failures that helps us reason about 
the various strategies for addressing these threats in a 
cost-effective manner. Using this model we show that the 
most important strategies for increasing the reliability of 
long-term storage are detecting latent faults quickly, au- 
tomating fault repair to make it faster and cheaper, and 
increasing the independence of data replicas. 

1 Introduction 

Frequent headlines remind us that bits, even bits stored 
in expensive, professionally administered data centers, 
are vulnerable to loss and damage [18, 21, 30, 38]. De- 
spite this, emerging web services such as e-mail (e.g., 
Gmail), photo sharing (e.g., Snapfish, Ofoto, Shutterfly)' 
and archives (e.g.. The Internet Archive) require large 
volumes of data to be stored indefinitely into the future. 
The economic viability of these services depends on stor- 
ing the data at very low cost. Customer acceptance of 
these services depends on the data remaining both unal- 
tered and accessible with low latency. 

Satisfying these requirements over long periods of 
time would be easy if fast, cheap, reliable disks were 
available, and if threats to the data were confined to 
the storage subsystem itself. Unfortunately, neither is 
true. The economics of high- volume manufacturing pro- 
vide a choice between consumer-grade drives, which are 

' All trademarks mentioned in this paper belong to their respective 
owners. 



cheap, fairly fast, and fairly reliable, and enterprise-grade 
drives, which are vastly more expensive, much faster but 
only a little more reliable. In § 6.1 we show that an en- 
terprise drive fourteen times more expensive than a con- 
sumer drive only reduces bit errors from 8 to 6 over a 
5 -year, 99% idle lifetime. For short-lived data, this level 
of reliability might not pose a problem, but for long-lived 
data these faults are inescapable. 

Further, long-term storage faces many threats beyond 
the storage system itself. These include obsolescence 
of data formats, long-term attacks on the data, and eco- 
nomic and structural volatility of organizations sponsor- 
ing the storage. 

In this paper, we make the case that long-term reli- 
able storage is a problem that deserves a fresh look, given 
its significant differences from traditional storage prob- 
lems addressed in the literature. We start by motivating 
the need for digital preservation — storing immutable 
data over long periods (§ 2). We then list the threats that 
imperil data survival using examples from real systems 
(§ 3), and examine why the design philosophy of many 
current storage systems is insufficient for long-term stor- 
age (§ 4). 

To understand the implications of this problem better, 
we introduce a simple reliability model (§ 5) of replicated 
storage systems designed to address long-term storage 
threats. This model is inspired by the reliability model 
for RAID [42], but our extensions and interpretation of 
the model take a more holistic, end-to-end, rather than 
device-oriented, approach. Our model includes a wider 
range of faults to which the replicas are subject. It ex- 
plicitly incorporates both latent faults, which occur long 
before they are detected, and correlated faults, when one 
fault causes others or when multiple faults result from the 
same error (§ 4). Our model incorporates and highlights 
the importance of a detection process for latent faults. 

Although our model is simplistic, it highlights needed 
areas for gathering reliability data, and it helps evalu- 
ate strategies for improving the reUabiUty of systems for 
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long-lived data (§ 6). For example, we would like to be 
able to answer questions such as, how dangerous are la- 
tent faults over long time periods? (Quite dangerous, 
§ 5.4.) Would it be better to repUcate an archive on 
tape or on disk? (Disk, § 6.2.) Is it better to increase 
the mean time between visible faults or between latent 
faults? (Perhaps neither if it significantly decreases the 
other, § 5.4.) Is it better to increase replication in the sys- 
tem or increase the independence of existing replicas? 
(Both, but replication without increasing independence 
does not help much, § 5.5.) Some of these strategies have 
been proposed before (§ 7), but we believe they are worth 
revisiting in the context of long-term storage. 

We conclude (§ 8) that the most important strategies 
for increasing reliability of long-term storage are detect- 
ing latent faults quickly, automating the repair of faults 
to be faster and cheaper, and increasing the independence 
of data replicas. This includes increasing geographic, 
administrative, organizational, and third-party indepen- 
dence, as well as the diversity of hardware and software. 
We hope that our analysis and conclusions, while still 
primitive, will motivate a renewed look at the design of 
storage systems that must preserve data for decades in 
volatile and even hostile envirormients. 

2 The Need for Long-term Preservation 

Preserving information for decades or even centuries has 
proved beneficial in many cases. In the 12th century BC 
Shang dynasty Chinese astronomers inscribed ecUpse 
observations on "oracle bones" (animal bones and tor- 
toise shells). About 3200 years later researchers used 
these records, together with one from 1302BC, to esti- 
mate that the total clock error that had accumulated was 
just over 7 hours, and from this derived a value for the 
viscosity of the Earth's mantle as it rebounds from the 
weight of the glaciers [41]. 

Longitudinal studies in the field of medical research 
depend upon accurate preservation of detailed patient 
records for decades. In 1948 scientists began to study the 
residents of Framingham, Massachusetts [13] to under- 
stand the large increase in heart disease victims through- 
out the 1930s and 40s. Using data collected over decades 
of research, scientists discovered the major risk factors 
that modem medicine now knows contribute to heart dis- 
ease. 

In 1975, the former USSR sent two probes, Venera 9 
and 10, to the surface of Venus to collect data and pho- 
tographs. The resulting photographs were of very low 
quality and were relegated to the dustbin of science his- 
tory. About 28 years later, an American scientist was 
able to use modern digital image processing algorithms 
to enhance the photographs to reveal much more detailed 
images [60]. 

These timescales of many decades, even centuries. 



contrast with the typical 5-year hfetime for computing 
hardware and similar lifetimes attributed to digital me- 
dia. It is not just scientific data that is expected to per- 
sist over these timescales. Legislation such as Sarbanes- 
Oxley [2] and HIPPA [1] require many companies and 
organizations to keep electronic records over decades. In 
addition, consumers used to analog assets such as mail 
and photographs, which persist over many decades, are 
now happily entrusting their digital versions to online 
services. The associated marketing hterature [16] en- 
courages them to expect similar longevity. 

3 Threats to Long-term Preservation 

While traditional short-term storage appUcations must 
also anticipate a variety of threats, the Ust of threats to 
long-term storage is longer and more diverse. Some 
threats, such as media and software obsolescence, are 
particular to long-term preservation. Other threats, such 
as natural disaster and human error, are also threats to 
short-term storage applications, but the probability of 
their occurrence is higher over the longer desired life- 
times for archival data. In this section we hst threats 
to long-term storage and provide real examples to mo- 
tivate them. We take an end-to-end approach in identi- 
fying failures, concentrating not just on storage device 
faults but also on faults in the environment, processes, 
and support surrounding the storage systems. 
Large-scale disaster. Large-scale disasters, such as 
floods, fires, earthquakes, and acts of war must be an- 
ticipated over the long desired lifetimes of archival data. 
Such disasters will typically be manifested by other types 
of threat, such as media, hardware, and organizational 
faults, and sometimes all of these, as was the case with 
many of the data centers affected by the 9/11 attack on 
the World Trade Center. 

Human error. One of the ways in which data are lost 
is through users or operators accidentally deleting (or 
marking as overwriteable) content they still need, or ac- 
cidentally or purposefully deleting data for which they 
later discover a need. For instance, rumors suggest 
that large professionally administered digital repositories 
have suffered administrative errors that caused data loss 
across repUcas, but the organizations hosting these repos- 
itories are unwilling to report the mishaps publicly (see 
§ 8). Sometimes the errors instead affect hardware (e.g., 
tapes being lost in transit [46]), software (e.g., unin- 
stalUng a required driver), or infrastructure (e.g., turning 
off the air-conditioning system in the server room) on 
which the preservation application runs. Human error is 
increasingly the cause of system failures [40, 45]. 
Component faults. Taking an end-to-end view of a sys- 
tem, all components can be expected to fail, including 
hardware, software, network interfaces, and even third- 
party services. Hardware components suffer transient re- 
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coverable faults (e.g., caused by temporary power loss) 
and catastrophic irrecoverable faults (e.g., a power surge 
fries a controller card). Software components, includ- 
ing firmware in disks, suffer from bugs that pose a risk 
to the stored data. Systems cannot assume that the net- 
work transfers they use to ingest or disseminate content 
will terminate within a specified time period, or will ac- 
tually deUver the content unaltered. (The initial ingestion 
of large collections into a repository is thus itself error- 
prone.) Third-party components that cannot easily them- 
selves be preserved are also sources of problems. Exter- 
nal license servers or the companies that run them might 
no longer exist decades after an application and its data 
are archived. Domain names will vanish or be reassigned 
if the registrant fails to pay the registrar, and a persistent 
URL [39] will not resolve if the resolver service fails to 
preserve its data with as much care as the storage system 
client. 

Media faults. The storage medium is a component of 
particular interest. No affordable digital storage medium 

is completely reliable over long periods of time, since 
the medium can degrade; degradation and other such 
medium errors can cause irrecoverable bit faults, often 
called bit rot. Storage media are also subject to sudden 
irrecoverable loss of bulk data from, for instance, disk 
crashes [53]. 

Bit rot is particularly troublesome, because it occurs 
without warning and might not be detected until it is too 
late to make repairs. The most familiar example might be 
CD-ROMs. Studies of CD-ROMs [20, 31] indicate that 
despite being sold as reliable for decades, or even 75 to 
100 years, they are often only good for two to five years, 
even when stored in accordance with the manufacturer's 
recommendations. 

Disks are subject to similar problems. For instance, a 
previously readable sector can become unreadable. Or a 
sector might be readable but contain the wrong informa- 
tion due to a previously misplaced sector write, often due 
to problems with vibrations [4] . 

Media/hardware obsolescence. Over time, media and 
hardware components can become obsolete — no longer 
able to cormnunicate with other system components — 
or irreplaceable after a fault. This problem is particu- 
larly acute for removable media, which have a long his- 
tory of remaining theoretically readable if only a suit- 
able reader could be found [25]. Examples include 9- 
track tape and 12-inch video laser discs. Evolution of 
the industry specification for PCs has made it difficult 
to purchase a commodity PC with built-in floppy drive, 
indicating that even floppy disks, once considered an ob- 
vious ubiquitous cheap storage medium, are endangered. 
Software/format obsolescence. Similarly, software 
components become obsolete. This is often manifested 
as format obsolescence: the bits in which the data were 



encoded remain accessible, but the information can no 
longer be correctly interpreted. Proprietary formats, even 
those in widespread use, are very vulnerable. For in- 
stance, digital camera companies have their own propri- 
etary "RAW" formats for recording the raw data from 
their cameras. These formats are often undocumented, 
and when a company ceases to exist or to support its for- 
mat, photographers can lose vast amounts of data [54]. 
Loss of context. Metadata, or more generally "context," 
includes information about layout, location, and inter- 
relationships among stored objects, as well as the subject 
and provenance of content, and the processes, algorithms 
and software needed to manipulate them. Preserving 
contextual metadata is as important as preserving the ac- 
tual data, and it can even be hard to recognize all required 
context in time to collect it. Encrypted information is a 
particularly challenging example, since preservation of 
the decryption keys is essential alongside preservation of 
the encrypted data. Unfortunately, over long periods of 
time, secrets (such as decryption keys) tend to get lost, 
to leak, or to get broken [14]. This problem is less of 
a threat to short-term storage appUcations, where the as- 
sets do not Uve long enough for the context to be lost or 
the information to become uninterpre table. 
Attack. Traditional repositories are subject to long-term 
malicious attack, and there is no reason to expect their 
digital equivalents to be exempt. Attacks include de- 
struction, censorship, modification and theft of repos- 
itories' contents, and disruption of their services [55]. 
The attacks can be short-term or long-term, legal or il- 
legal, internal or external. They can be motivated by 
ideological, political, financial or legal factors, by brag- 
ging rights or by employee dissatisfaction. Attacks are a 
threat to short-term storage as well, but researchers usu- 
ally focus on short-term, intense attacks, rather than the 
slowly subversive attacks that afflict long-term reposito- 
ries. Because much abuse of computer systems involves 
insiders [24], a digital preservation system must antici- 
pate attack even if it is completely isolated from external 
networks. Examples of attacks include cases of "saniti- 
zation" of US government websites to conform with the 
administration's world view [19, 35]. 
Oi^anizational faults. A system view of long-term stor- 
age must include not merely the technology but also the 
organization in which it is embedded. These organiza- 
tions can die out, perhaps through bankruptcy, or change 
missions. This can deprive the storage technology of the 
support it needs to survive. System planning must en- 
visage the possibihty of the asset represented by the pre- 
served content being transferred to a successor organiza- 
tion, or otherwise being properly disposed of with a data 
"exit strategy." Storage services can also make mistakes, 
and assets dependent on a single service can be lost. 
As an example, during an organizational change, a 
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large IT company closed a research lab and requested 
the lab's research projects be copied to tape and sent to 
another of its labs. Unfortunately, few knew about these 
tapes and they were allowed to languish without docu- 
mentation of their contents. When it became clear that 
some of the project data would be useful to current re- 
searchers, enough time had passed that nobody could 
identify what would be on which tape, and the volume 
of data was too huge to reconstruct an index [28]. 

As another example, a user lost all of her digital photos 
stored at Ofoto when she did not make a purchase within 
the required interval and did not receive their email warn- 
ings due to having failed to send them her updated email 
address [29]. Further, for many of these services there 
is no "data exit strategy" for easy bulk retrieval of the 
original high- resolution format of all of of the customer's 
photos. 

Economic faults. Many organizations with materials to 
preserve do not have large budgets to apply to the prob- 
lem, and declare success after just managing to get a col- 
lection put online. Unfortunately, this provides no plan 
for maintaining a collection's accessibility or quahty in 
the future. There are ongoing costs for power, cooUng, 
bandwidth, system administration, equipment space, do- 
main registration, renewal of equipment, and so on. In- 
formation in digital form is much more vulnerable to 
interruptions in the money supply than information on 
paper, and budgets for digital preservation must be ex- 
pected to vary up and down, possibly even to zero, over 
time. These budget issues affect our ability to preserve 
as many collections as desired: many libraries now sub- 
scribe to fewer serials and monographs [6]. 

Motivating an investment in preservation can be diffi- 
cult [17] without better tools to predict long-term costs, 
especially if the target audience for the preserved infor- 
mation does not exist at the time decisions are made. 
While budget is an issue in the purchase of any storage 
system, it is usually easier to plan how money will be 
spent over a shorter Ufespan. 

4 Dangerous Assumptions 

Some of the threats above are well-understood and we 
should have succeeded in solving these problems by now. 
So why do we stiU lose data? Part of the reason is that our 
system designs often fail to take an end-to-end perspec- 
tive, and that, in practice, we often make a few key, but 
potentially dangerous assumptions. These include vis- 
ibility of faults (§ 4.1), independence of faults (§ 4.2), 
and enough money to apply exotic solutions (§ 4.3), as 
described below. 

4.1 The fault visibility assumption 

While many faults are detected at the time an error 
causes them, some occur silently. These are called "la- 



tent faults." There are many sources of latent faults, but 
media errors are the best known. While a head crash 
would be detectable, bit rot might not be uncovered until 
the affected faulty data are actually accessed and audited. 
As another example, a sector on a disk might become un- 
readable; this would not be detected until the next read of 
that sector. Further, a sector might be readable but con- 
tain incorrect information due to a previous misplaced 
sector write. Silent media errors and faults occur more 
frequently than many of us have assumed; for instance 
Schwarz et al. suggest that silent block faults occur five 
times as often as whole disk faults [50]. 

In aggregate, archives such as the Internet Archive 
might supply users with data items at a high rate, but 
the average data item is accessed infrequently. Detecting 
loss or corruption only when a user requests access ren- 
ders the average data item vulnerable to an accumulation 
of latent faults [23, 34, 50]. 

While the need to guard against latent faults has long 
been recognized in larger systems |8, 23, 50], increases 
in storage capacity have recently brought more attention 
to the problem for commodity storage. An important ex- 
ample is the IRON File System [44], which uses redun- 
dancy within a single disk to address latent faults within 
file system metadata structures. 

Beyond media faults, there are many types of latent 
faults caused by threats in § 3: 

• Human error, accidental deletion or overwriting of 
materials might not be discovered until those mate- 
rials are needed. 

• Component failure: The rehance on a failed sys- 
tem component or a third-party component that is no 
longer available might not be discovered until data 
depending on that component are accessed. 

• Media/hardware obsolescence: Failure of an obso- 
lete, seldom-used media reader might not be dis- 
covered until information on the associated medium 
needs to be read. It might then be impossible or too 
costly to purchase another reader. 

• Software/format obsolescence: Upon accessing old 
information we might discover it is in a format be- 
longing to an application we can no longer run. 

• Loss of context: We might not discover we are miss- 
ing crucial metadata about saved data until we try to 
make sense of the data. For instance, we might not 
have preserved an encryption key. 

• Attack: Results of a successful censorship or cor- 
ruption attack on a data repository might never be 
discovered, or might only become apparent upon ac- 
cessing the data long after the attack. 

The general solution to latent faults is to detect them 
as quickly as possible, as indicated by our model in § 6. 
For instance, "scrubbing" [23, 33, 50] can be used to 
detect media faults or evidence of data corruption from 
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an attack. If preserving the data is important, scrubbing 
should be performed while the data can still be repaired 
from a replica (or error correction codes). At a higher 
layer, we can prevent latent faults in the making by de- 
tecting the need to migrate content from old to new me- 
dia or from old to new data formats before we have lost 
the ability to read the old medium or interpret the old 
data format. Another example is detecting the need to 
re-encrypt old materials with new keys before old keys 
are considered obsolete. 

4.2 The independence assumption 

Many researchers and designers correctly point to data 
replication as a strategy for preventing data loss, but 
make the assumption that replicas fail independently. 
Alas, in practice, faults are not as independent as we 
might hope. Talagala [53] logged every fault in a large 
disk farm (of 368 disk drives) at UC Berkeley over 
six months and showed significant correlations between 
them. (For example, if power units are shared across 
drives, a single power outage can affect a large number of 
drives.) A single power outage accounted for 22% of all 
machine restarts. Temperature and vibrations of tightly 
packed devices in machine room racks [4, 32] are also 
sources of correlated media failures. 

Taking an end-to-end perspective, there are many 
sources of fault correlation corresponding to the threats 
in §3: 

• Large-scale disaster: A single large disaster might 
destroy aU replicas of the data. Geographic repU- 
cation clearly helps, but care must be taken to en- 
sure it provides sufficient independence. For exam- 
ple, in the 2001 9/11 disaster in New York City, a 
data center was destroyed. The system failed over to 
a replicated data center on the other side of a river, 
and the failover worked correctly. Unfortunately, the 
sites were still not sufficiently distant; the chaos in 
the streets prevented staff from getting to the backup 
data center. Eventually it was unable to continue 
unattended [56]. 

• Human error. System administrators are human and 
fallible. Unfortunately, in most systems they are also 
very powerful, able to destroy or modify data with- 
out restriction. If all replicas are under unified ad- 
ministrative control, a single human error can cause 
faults at all of them. 

• Component faults: If all replicas of the information 
are dependent on the same external component, for 
instance a license server, the loss of that component 
causes correlated faults at every replica. 

• Loss of context: Losing metadata associated with 
archival data might cause correlated faults across 
replicas. For instance, if materials are encrypted 
with the same key at all replicas, loss of that key will 



make all the rephcas useless. 

• Attack: Attacks can cause correlated faults. For in- 
stance, a flash worm might affect many rephcas at 
once. 

• Organizational faults: Long-term storage must an- 
ticipate the failure of any one organization or ser- 
vice. Increasing the visibihty of digital assets and 
developing simple exit strategies for data is impor- 
tant, but so is minimizing dependence on any single 
organization. 

As indicated in § 6 by our model, the best way to avoid 
correlated faults is independence of the replicas. Simply 
increasing the replication is not enough if we do not also 
ensure the independence of the replicas geographically, 
administratively, and otherwise. 

4.3 The unUmited budget assumption 

The biggest threats to digital preservation are economic 
faults. With a lot of money we could apply many so- 
lutions to preserving data, but most of the information 
people would like to see live forever is not in the hands 
of organizations with unhmited budgets. Solutions such 
as synchronous mirroring of RAIDs across widely dis- 
persed geographic rephcas might not be affordable over 
the long term for many organizations. Although we make 
a quahtative attempt to compare the costs of some of the 
strategies and solutions we explore in this paper, estima- 
tion of these costs remains very difficult and is an area 
richly deserving further work. 

5 Analytic IModel 

Abstract models such as that in Patterson et al. [42] and 
Chen et al. [9] are useful for reasoning about the rehabil- 
ity of different replicated storage system designs. In this 
section, we build upon these models to further incorpo- 
rate the effect of latent and correlated faults on the over- 
all reliability of an arbitrary unit of replicated data. Our 
model is agnostic to the unit of replication; it can be a bit, 
a sector, a file, a disk or an entire storage site. We there- 
fore attempt to develop a more abstract model that can 
be interpreted in a more general, holistic fashion. While 
still coarse-grained, our model is nonetheless helpful for 
reasoning about the relative impact of a broader range of 
faults, their detection times, their repair times, and their 
correlation. The model helps point out what strategies 
are most likely to increase reliability, and what data we 
need to measure in real systems to resolve tradeoffs be- 
tween these strategies. 

We start with a simple, abstract definition of latent 
faults. We then derive the mean-time-to-failure of mir- 
rored data, in the face of both immediately visible and 
latent faults. We extend this equation to include the effect 
of correlated faults. Finally, we discuss the implications 
of this equation for long-term storage. 
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Figure 1: Types of replica faults. Time flows from left to right. 
At the top, when a visible fault (sad face) is detected, recovery 
begins immediately. At the end of successful recovery, the fault 
has been corrected ( smiley face). At the bottom, when a latent 
fault occurs (sad face in sunglasses), nothing happens until the 
fault is detected. Once the fault is detected, as with visible 
faults, recovery takes place. 

5.1 Fault types 

We assume that there are two types of faults: immedi- 
ately visible and latent, as shown in Figure 1. Visible 
faults are those for which the time between their occur- 
rence and detection is negligible. Causes of such faults 
include entire-disk or controller errors. We denote the 
mean time to a visible fault by MV and the associated 
mean time to repair by MRV. Latent faults are those 
for which the time between occurrence and detection is 
significant. Examples include misdirected writes, bit rot, 
unreadable sectors, and data stored in obsolete formats. 
We denote the mean time to a latent fault by ML and 
mean time to repair by MRL. We only consider la- 
tent faults that are detectable; hence, they have a finite 
mean time between occurrence and detection, denoted 
by MDL. 

5.2 Assumptions 

Our model is based on several assumptions. We build 
our model starting from the simplest assumptions and in- 
crease its complexity as needed. To start, with no addi- 
tional information, the simplest assumption we can make 
(as do others [42]) regarding the processes that generate 
faults (latent or visible) is that they are memoryless. That 
is, the probability, P{t), of a fault occurring within time 
t, is independent of the past. This assumption leads to 
the following exponential distribution 



P{t) = 1 - e*/MTTF_ 



(1) 



where MTTF is the mean time to the fault. For many 
parts of our derivation, we consider the case where t « 
MTTF, so the following approximation holds 



1 



^t/MTTF 
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t 
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t 
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(2) 



This approximation and similar ones below are used only 
to simplify the expression for the exponential in the prob- 



ability, similar to Patterson et al. [43], so are not funda- 
mental to the model. 

Initially, we also assume that all faults occur indepen- 
dently of one another. Subsequently, we revise this as- 
sumption by introducing correlated errors that are also 
exponentially distributed but with an increased rate of 
occurrence. For simplicity, we model this increased rate 
via a multiplicative correlation factor, which is assumed 
to be the same for both latent and visible faults. 

Furthermore, undetectable faults might also exist, but 
we ignore them for this analysis. Their effect on the over- 
all reliability will dominate only if their rate of occur- 
rence is significant. In such a case, we must turn them 
into detectable faults, by developing a detection mech- 
anism for them, and correct them, by employing redun- 
dancy [27]. Otherwise, they will remain the main vulner- 
ability for the stored data. 

5.3 Reliability 

Mirrored data become irrecoverable when there are two 
successive faults such that the copy fails before the ini- 
tial fault can be repaired, an event we call a double-fault. 
Since a double-fault leads to data loss for mirrored repli- 
cas, for most of this section, ^•j^jyi^ is equal to the rate 
of double-fault failures, where MTTDL is the mean time 
to data loss. In this section, we derive an expression for 
this quantity, which represents the reliability of mirrored 
data, to understand how it is affected by visible, latent, 
and correlated faults. Note that the double-fault rate is 
meaningful regardless of whether the faults causing those 
failures are detected or even detectable. Thus, through- 
out this section, our reliability analysis is from the per- 
spective of the data rather than the perspective of the user 
for whom errors can go unnoticed. 

To estimate MTTDL, we first need to estimate the 
probability that a second fault will occur while the first 
fault is still unrepaired. We refer to this unrepaired pe- 
riod as the window of vulnerability (WOV). Since there 
are two types of faults, we need to consider the window 
of vulnerability after each type, as illustrated in Figure 2. 

First, consider the WOV after a visible fault, Vi, 
which on average is MRV. During this WOV, both latent 
and visible faults can occur The probability that another 
visible fault, V2 occurs is 



P{V2\Vi) = 



MRV 
MV 



(3) 



where MRV ^ MV. We obtain this result by using the 
approximation in eqn 2. 

The probability that another latent fault, L2 occurs is 



MRV 
ML 



(4) 



where MRV < ML. The difference between P{V2\Vi) 
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Figure 2; Combinations of double faults resulting in data loss. 
The y axis indicates the type of the first fault (first sad face) 
and the x axis indicates the type of the second fault {second 
sad face). After the first fault occurs, there is a window of vul- 
nerability during which the occurrence of a second fault will 
lead to data loss. After visible faults, this window only consists 
of the recovery period. After latent faults, this window also 
includes the time to detect the fault. If correlated, the second 
fault is more likely to occur within this window. 



and P{L2\Vi) arises only from the different rates of fault 
occurrence. 

Next, consider the WOV after a latent fault, ii, 

which on average is MRL + MDL. Again, during this 
WOV, both latent and visible faults can occur, and the 
difference in the two probabilities of occurrence is sim- 
ply due to the different rates of occurrence for visible and 
latent faults. 

The probability that another visible fault, V2 occurs is 



MDL + MRL 
MV 



(5) 



and the probability that another latent fault, L2, occurs is 



P{L2\Li) 



MDL + MRL 
ML 



(6) 



As before, MRL + MDL < MV and MRL + MDL < 
ML. Note that if MDL becomes large, the equations 
do not hold and the combined P{V2\Li) + P{L2\Li) ^ 
P{V2 V L2IL1) approaches 1. 

Next, we calculate the total double-fault failure rate 
as foUows 



1 



P(y2\Vi) + P{L2\Vi) , P{V2\L^) + P{L: 



To account for correlated faults, we assume that the 
probability of the second fault (conditioned on the oc- 
currence of the first) is also exponentially distributed, 
but with a faster rate parameter We introduce a multi- 
plicative correlation factor a < 1 that reduces the mean 
time to the subsequent fault once an initial fault occurs. 
In this case, equations 3-6 are multiplied by ^. This is 
undoubtedly a vast simplification of how faults correlate 
in practice. Modeling correlations accurately relies on 
modeling a particular system instantiation and its com- 
ponent interactions, and often is considered a black art. 
We are at least not alone in using the simplification [11]. 
An alternative would be to introduce a distinct MTTF for 
correlated faults that is less than the independent MTTF, 
as done for RAID by Chen et al. [9] . 

Combining the previous equations and accounting for 
correlated faults, MTTDL becomes 



a ■ ML^MV^ 
(MV + ML)(MRV • ML + (MRL + MDL) • MV) 

(8) 

5.4 Implications 

To understand the implications of equation 8, we inves- 
tigate its behavior at various operating ranges. We con- 
sider the cases in which visible faults are much more fre- 
quent than latent faults and vice versa. We also consider 
the case in which the WOV after a latent fault occurs is 
long. We briefly discuss the implications in each of these 
cases and touch upon reliability metrics for higher levels 
of redundancy. 

First, consider the case in which the visible fault rate 
dominates the latent fault rate, {MRL+MDL, MRV} < 
MV < ML. Then, 



MTTDL 



MV^ 



(9) 



MRV 

because MV + ML « ML and MRV ■ ML > (MRL + 
MDL) • MV. In this case, the effect of latent faults is 
negligible, and thus the equation appropriately resembles 
the original RAID reliability model [42] . 

On the other hand, if the latent fault rate dominates the 
visible fault rate, {MRL+MDL, MRV} < ML < MV, 
then 



ML^ 



MRL + MDL 



(10) 



MTTDL 



MV 



ML 



(7) 

where the first term on the right side counts the fraction 
of the visible faults that result in double-failures, and the 
second term counts the fraction of latent faults that result 
in double-failures. 



MTTDL 

\Li) 

— because MV + ML w MV and MRV ■ ML < (MRL + 
MDL) • MV. Equation 10 indicates that if latent faults 
are frequent, we must reduce MDL, as it can negate the 
additional ML factor of reliability, a result of replication. 

Next, consider the case in which the visible fault rate 
dominates, MRV < MV < ML, but latent faults are 
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frequent enough to be non-negligible. If the window 
of vulnerability after a latent fault occurs is long, then 
latent faults still play a role in increasing the double- 
fault failure rate. The WOV can be long because ei- 
ther the detection time for latent faults is long (MDL « 
ML), the repair times are long (MRL « ML), or both 
(MRL-I-MDL « ML). In this case, ^(^2 VL2|ii) « 1 
in equation 7, ensuring that a single latent fault is ex- 
tremely likely to lead to a double-fault failure. As a re- 
sult, the approximation 



MTTDL 



a ■ MV^ 



MRV 



ML 



(11) 



holds when latent faults rates are non-negligible, i.e. 
ML < MV^. 

These specializations of the model point to four impor- 
tant implications. First, in equations 9 and 10, MTTDL 
varies quadratically with both MV and ML, and in par- 
ticular, with the minimum of MV and ML. Thus, we 
must consider occurrence rates for both fault types to 
improve overall system reliability. We must be careful 
not to sacrifice one for the other, which might happen in 
practice, since MV and ML can often be anti-correlated 
depending upon hardware choice and detection strategy. 

Second, equation 10 indicates that if latent faults are 
frequent, it is important to reduce their detection time, 
and not just their repair time. More specifically, consider 
a Seagate Cheetah disk with a MV of L4 x 10^ hours, 
bandwidth of 300 MB/s, and capacity of 146GB, leading 
to MRV of 20 minutes. Following Schwarz et al. [50]), 
we assume that latent faults are five times as likely as 
visible faults, resulting in an ML of 2.8 x 10^ hours. 
Without scrubbing, we cannot justify the approximations 
leading to equation 10. Therefore, applying equation 7 
and substituting P{V2 V L2I-L1) ~ 1, we achieve an 
MTTDL = 32.0 years. This gives a 79.0% probabiUty 
of data loss in 50 years if we plug it into the exponentiiil 
distribution. 

On the other hand, if we scrub a replica 3 times a 
year, as suggested by Schwarz et al, MDL is 1460 hours 
(which is half of the scrubbing period). Then, applying 
equation 10, our reliability increases significantly. With 
no correlated errors, MTTDL = 6128.7 years, which 
gives a 0.8% chance of data loss in 50 years. This implies 
that proactive searching for latent faults is appropriate. 
Other work agrees with this conclusion [23, 44, 50]. 

Third, in our model, correlation is a multiplicative fac- 
tor and affects the reliability regardless of the type of 
fault we have. Continuing with our example above, as- 
sume a = 0.1 as suggested by Chen et al. [9]. Then, 
MTTDL = 612.9 years, which gives a 7.8% chance of 
data loss in 50 years. 

The above is a conservative assumption for a because 
this correlation factor can vary by several orders of mag- 



nitude. To obtain a reasonable lower bound on a, con- 
sider that the correlated mean-time-to-second-visible- 
fault can be an order of magnitude larger than the re- 
covery time; a • MV > 10 • MRV, then a > 
For example, a bug in the firmware recovery code for a 
RAID controller might cause the mean-time-to-second- 
fault to be not much more than the mean time to recover. 
To obtain a specific lower-bound value, we assume the 
same values as above for MV and MRV = MRL, result- 
ing in 1 > a > 2 X 10~^, which gives a range of at least 
5 orders of magnitude. 

Fourth, even when latent faults are infrequent, equa- 
tion 11 indicates that not attempting to detect latent 
faults, or relying on very lengthy recovery procedures 
to fix them, will leave the system vulnerable. For ex- 
ample, if ML = 1.4 X 10^, MV and MRV remain the 
same, and a = 0.1, then MTTDL = 159.8 years, lead- 
ing to a 26.8% probablility of data loss in 50 years. In 
this case, because the system is neghgent about handhng 
latent faults, the data are more susceptible to double-fault 
failures from visible and latent faults following an initial 
unrepaired latent fault. 

5.5 Replication and correlated faults 

In this section, we show that additional replication does 
not offer much additional reliability without indepen- 
dence. To simplify our reliability analysis for higher 
degrees of replication, we assume that we have instru- 
mented the system so that MDL is negligible, and we 
assume that latent and visible faults have similar rates 
and repair times. We roughly estimate the MTTDL of 
a system with a degree of repUcation, r, by extending 
the analysis along the same lines as for mirrored data. 
We calculate the probability that r — 1 successive, com- 
pounding faults after an initial fault will leave the system 
with no integral copy from which to recover 

To do so, we calculate the probability of r — 1 succes- 
sive faults occurring within the vulnerability windows of 
each previous fault. For simplicity, this analysis assumes 
that the vulnerability windows, each of length MRV, 
overlap exactly. In that case, the probability that the fc-th 
copy fails within the WOV of the previous (fc — 1) failed 
copies is roughly ^^y . Thus, the probability that suc- 
cessive r — 1 copies all fail within the WOV of previously 
failed copies is the product of those r — 1 probabilities, 
( a^Mv )'^"^' 'Si'lce the first fault occurs with rate the 
overall mean-time-to-data-loss is 



MTTDL = MV • (- 



MRV 



MV 



MRV 



(12) 



This equation shows that although increasing the level 
of repUcation, r, geometrically increases MTTDL, a 
high degree of correlated errors (a <C 1) would also ge- 



9 



ometrically decrease MTTDL, thereby offsetting much 
or all of the gains from additional repUcas. 

6 Strategies 

This simple model reveals a number of strategies for re- 
ducing the probability of irrecoverable data loss. While 
we generally describe them in terms of the more famil- 
iar hardware and media faults, they are also appUcable 
to other kinds of faults. For instance, in addition to de- 
tecting faults due to media errors, scrubbing can detect 
corruption and data loss due to attack. As another exam- 
ple, we can use a similar process of cycling through the 
data, albeit at a reduced frequency, to detect data in en- 
dangered formats and convert to new formats before we 
can no longer interpret the old formats. 

• Increase MV by, for example, using storage media 
less subject to catastrophic data loss such as disk 
head crashes. 

• Increase ML by, for example, using storage media 
less subject to data corruption, or formats less sub- 
ject to obsolescence. 

• Reduce MDL by, for example, auditing the data 
more frequently to detect latent data faults, as in 
RAID scrubbing. 

• Reduce MRL by, for example, automatically repair- 
ing latent data faults rather than alerting an operator 
to do so. 

• Reduce MRV by, for example, providing hot spare 
drives so that recovery can start immediately rather 
than once an operator has replaced a drive. 

• Increase the number of replicas enough to survive 
more simultaneous faults. 

• Increase a by increasing the independence of the 
rephcas. 

In the rest of this section we examine the practical- 
ity and costs of some techniques for implementing these 
strategies. We provide examples of systems using these 
techniques. 

6.1 Increase MV or ML 

Based on Seagate's specifications, a 200GB consumer 
Barracuda drive has a 7% visible fault probability in a 5- 
year service life [51], whereas a 146GB enterprise Chee- 
tah has a 3% fault probability [52]. But the Cheetah 
costs about 14 times as much per byte^. The Barracuda 
has a quoted irrecoverable bit error rate of 10"^'* and 
the Cheetah of 10~^^. Even if the drives spend their 5 
year life 99% idle, the Barracuda will suffer about 8 and 
the Cheetah about 6 irrecoverable bit errors. This 14- 
fold increase in cost between consumer and enterprise 
disk drives yields approximately half the probability of 
in-service fault and about 3/4 the probability of irrecov- 
erable bit fault. Thus, for long-term storage applications 

2$8.20/GB versus $0.57/GB. Prices from TigerDirect.com 6/13/05. 



whose requirements for latency and individual disk band- 
width are minimal, the large incremental cost of enter- 
prise drives is hard to justify compared to the smaller in- 
cremental cost of more (sufficiently independent) repli- 
cas on consumer drives. 

6.2 Reduce MDL 

A less expensive approach to addressing latent faults due 
to media errors is to detect the faults as soon as possible 
and repair them. The only way to detect these faults is to 
audit the replicas by reading the data and either comput- 
ing checksums or comparing against other replicas. 

Assuming (unrealistically) that the detection process is 
perfect and the latent faults occur randomly, MDL will 
be half the interval between audits, so the way to reduce 
it is to audit more frequently. Another way to put this 
is that one can reduce MDL by devoting more disk read 
bandwidth to auditing and less to reading the data; re- 
cent work suggests that in many systems we can achieve 
a reasonable balance of auditing versus normal system 
usage [44, 50]. 

On-line replicas such as disk copies have significant 
advantages over off-line copies such as tape backups, for 
two reasons. First, the cost of auditing an off-line copy 
includes the cost of retrieving it from storage, mounting 
it in a reader, dismounting it and returning it to storage. 
This can be considerable, especially if the off-line copy 
is in secure off-site storage. Second, on-line media are 
designed to be accessed frequently and involve no error- 
prone human handling. They can thus be audited with- 
out the audit process itself being as significant a cause of 
faults (up to some limit, determined in part by the system 
strategy for powering components on and off [50]). Au- 
diting off-line copies, on the other hand, is a significant 
cause of highly correlated faults, from the error-prone 
human handling of media [46] to the media degradation 
caused by the reading process [3]. 

The audit strategy is particularly important in the case 
of digital preservation systems, where the probability 
that an individual data item will ever be accessed by a 
user during a disk lifetime is vanishingly small. The sys- 
tem cannot depend on user access to trigger fault detec- 
tion and recovery, because during the long time between 
accesses latent faults will build up enough to swamp re- 
covery mechanisms. A system must therefore aggres- 
sively audit its replicas to minimize MDL. The LOCKSS 
system is an example of such a system. 

Note that relying on off-line replicas for security is not 
fool-proof. Off-line storage may reduce the chances of 
some attacks, but it may still be vulnerable to insider at- 
tacks. Because it is harder to audit, the damage due to 
such attacks may persist for longer. 
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6.3 Reduce MRL or MRV 

Although the mean time to repair a latent media fault will 
normally be far less than the mean time to detect it, and 
similarly the mean time to repair a visible fault will be far 
less than the mean time to its occurrence, reducing the 
mean time to repair is nevertheless important in reduc- 
ing the window during which the system is vulnerable to 
correlated faults. 

Again in this case, on-line replicas have the major ad- 
vantage that repair times for media faults might be very 
short indeed, a few media access times. No human in- 
tervention is needed, and the process of repair is in itself 
less likely to cause additional correlated faults. Repair- 
ing from off-line media incurs the same high costs, long 
delays and potential correlated faults as auditing off-Une 
media. 

6.4 Increase replication 

Off-Une media are the most common approach to in- 
creasing rephcation. But the processes of auditing and 
recovering from faults using off-line backup copies can 
be slow, expensive, and error-prone. 

Some options for disk-based replication strategies in- 
clude replication within RAID systems, across RAID 
systems, and across simple mirrored replicas. Replica- 
tion within RAID systems does not provide geographi- 
cal or administrative independence of the replicas. If we 
opt for geographic and administrative independence of 
replicas, it might be that the extra single-site reliability 
provided by RAIDs is not worth the extra cost from a 
system-wide perspective. Further, given the cost dispar- 
ity between enterprise-grade drives and consumer-grade 
drives (see § 1), adding more simple mirrored replicas in 
non-RAID configurations might well be a cost-effective 
approach to increasing replication and thus overall reli- 
ability. OceanStore[26] is an example of using a large 
number of replicas on cheap disks. 

6.5 Increase independence 

In just the six months of the Talagala study [53], many 
correlated faults were observed, apparently caused by 
disks sharing power, cooling, and SCSI controllers, and 
systems sharing network resources. Our model suggests 
that in most cases, even with far lower rates of correlated 
faults, increasing the independence of replicas is critical 
to increasing the reliability of long-term storage. 

Long-term storage systems can reduce the probability 
of correlated faults by striving for as much diversity as 
possible in hardware, software, geographic location, and 
administration, and by avoiding dependence on third- 
party components and single organizations. Examples 
include: 

• Hardware: Disks in an array often come from a sin- 
gle manufacturing batch. They thus have the same 



firmware, same hardware and are the same age, and 
so are at the same point in the "bathtub" lifetime 
failure curve [15]. However, the increased cost that 
would be incurred by giving up supply chain effi- 
ciencies of bulk purchase might make hardware di- 
versity difficult. Note, though, that replacing all 
components of a large archival system at once is 
Ukely to be impossible. If new storage is added in 
"rolling procurements" over time [7], then differ- 
ences in storage technologies and vendors over time 
naturally provides hardware heterogeneity. 

• Software: Systems with the same software are vul- 
nerable to epidemic failure; Junqiera et al. have stud- 
ied how the natural diversity of systems on a campus 
can be used to reduce this vulnerability [22]. How- 
ever, the increased costs caused by encouraging such 
diversity, in terms not merely of purchasing, but also 
of training and administration, might again make this 
a difficult option for some organizations. The British 
Library's system [7] is unusual in expUcitly planning 
to develop diversity of both hardware and software 
over time. Nevertheless, given the speed with which 
malware can find all networked systems sharing a 
vulnerability, increasing the diversity of both plat- 
form and apphcation software is an effective strategy 
for increasing a. 

• Geographic location: Many systems using off-line 
backup store replicas off-site, despite the additional 
storage and handling charges that impUes. Digital 
preservation systems, such as the British Library's, 
establish each of their on-hne rephcas in a different 
location, again despite the possible increased opera- 
tional costs of doing so. 

• Administration: Human error is a common cause of 
correlated faults among replicas. Again, the British 
Library's system is unusual in ensuring that no sin- 
gle administrator will be able to affect more than one 
replica. This is probably more effective and more 
cost-effective than attempts to implement "dual-key" 
administration, in which more than one administra- 
tor has to approve each potentially dangerous action. 
In a crisis, shared pre-conceptions are likely to cause 
both operators to make the same mistake [45]. 

• Components: System designs should avoid depen- 
dence on third-party components that might not 
themselves be preserved over time. Determining all 
sources of such dependence can be tricky, but some 
sources can be detected by running systems in isola- 
tion to see what breaks. For example, running a sys- 
tem in a network without a domain name service or 
certificate authority can determine whether the sys- 
tem is dependent on those services. LOCKSS is an 
example of a system built to be independent of the 
survival of these services. 
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• Organization: Taking an end-to-end view of preser- 
vation systems, it is also important to support their 
organizational independence. For instance, if the im- 
portance of a collection extends beyond its current 
organization, then there must be an easy and cost- 
effective "exit strategy" for the collection if the or- 
ganization ceases to exist. As an example, bickering 
couples might not want the survival of their babies' 
photos to depend on a home IT system whose main- 
tenance requires their continued cohabitation. 

In summary, the main techniques for increasing the 
reUability of long-term storage are replication, indepen- 
dence of replicas, auditing replicas to detect latent faults, 
and automated recovery to reduce repair times and costs. 

6.6 Tradeoffs 

Unfortunately, these strategies are not necessarily or- 
thogonal, and some can have adverse affects on reliabil- 
ity. Here we consider the effects of auditing and auto- 
mated recovery. There are many other possible tradeoffs, 
such as the costs of increased independence of adminis- 
trative domains and diversity of hardware and software 
that we do not cover here. 

Auditing is necessary for detecting latent faults, but as 
previously described, it increases the frequency of me- 
dia access, which might increase both visible and la- 
tent media errors and costs due to increased consump- 
tion of power and other system and administrative re- 
sources. Previous work [23, 44, 50] suggests it is pos- 
sible to achieve an appropriate balance that increases re- 
liability considerably while performing much of the audit 
work in background, even opportunistically when legiti- 
mate data accesses require powering on the correspond- 
ing system components. 

Unfortunately, the audit process can itself introduce 
other channels for data corruption. An example would be 
attacking a distributed system through the audit protocol 
itself [33], which therefore must be designed as carefully 
as any other distributed protocol. 

While automated recovery can reduce costs and speed 
up recovery times, if buggy or compromised by an at- 
tacker, it can itself introduce latent faults. This can be 
dangerous because even visible faults can now (though 
seemingly having been recovered) turn into latent ones. 

6.7 Data gathering 

Our simple model of reliability of replicated storage 
points out areas where we are in great need of further 
data to validate the model and to evaluate the potential 
utility of the reliability strategies described in this paper. 
In particular, there is very little published about the types 
and distribution of latent faults, both due to media errors 
and also due to the other threats we describe. Moreover, 
the correlations that result in latent faults are poorly un- 



derstood. 

One desirable application of the model would be 
choosing, for instance, between two levels of replication 
and audit. Assume that for disaster tolerance we have 
two geographically independent replica systems. Would 
it be better for each system to audit its storage internally? 
Or would it be better to audit between the two rephcas? 
Answering such a question requires understanding, at a 
minimum: the MTTF of both visible and latent faults, 
the MDL of different audit strategies, the recovery strate- 
gies, the costs of repUcating information internally and 
geographically, and the costs of auditing internally ver- 
sus between sites. 

To gather this information we can instrument existing 
systems. For a start, we can log occurrences of visible 
faults, detection of latent faults, and occurrences of data 
loss. One approach for detecting latent faults is to cy- 
cle proactively through the storage logging checksums 
of immutable objects. Ideally, these fault occurrences 
would include timestamps and vertical information about 
the location of the fault: block, sector, disk, file, appli- 
cation, etc. We would use such information to determine 
the distributions and rates of the faults that we consider, 
thereby testing the validity of our assumptions. Addi- 
tionally, we can log information about recovery proce- 
dures performed (e.g., replacement of disks or recovery 
from tape), their duration, and outcomes. We could use 
such data to measure mean recovery times and, combined 
with the previous information, validate the model itself. 
We could employ metadata about the configuration: me- 
dia, hardware, software, etc., to estimate costs. Finally, 
we could log SMART data from disks, and external in- 
formation such as application workloads offered, pro- 
cesses spawned, file system and network statistics, and 
administrator changes. We could mine such information 
to identify and perform root cause analysis for correlated 
faults. 

Efforts are underway to gather some of this informa- 
tion for existing systems. For instance, several groups at 
UC Santa Cruz, HP Labs and elsewhere have been pro- 
cessing just such failure data from large archives such as 
the Internet Archive. We need information, though, from 
many different kinds of storage systems with different 
replication architectures. 

7 Related Work 

In this section we review related work, showing how oth- 
ers address problems, such as correlated and latent faults, 
that arise when large amounts of data must remain unal- 
tered yet accessible with low latency at low cost. We 
start with low-level approaches that focus on single de- 
vices and RAID arrays, then move up the stack. 

Evidence of correlated faults comes from studies of 
disk farms. Talagala [53] logs every fault in a large disk 
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farm (of 368 disk drives) at UC Berkeley over six months 
and shows significant correlations between them. The 
study focuses primarily on media (drive failures), power 
failures, and system software/dependency failures. 

Chen et al. [9] explore the tradeoffs between perfor- 
mance and reliability in RAID systems, noting that sys- 
tem crashes (i.e., correlated faults) and uncorrectable bit 
errors (i.e., latent faults) can greatly reduce the reliabiUty 
predicted in the original RAID paper [42]. Kari's disser- 
tation [23] appears to be the first comprehensive analysis 
of latent disk failures. His model also shows that they 
can greatly reduce the reliability of RAID systems, and 
presents four scrubbing algorithms that adapt to disk ac- 
tivity, using idle time to flush out latent errors. In con- 
trast, we explore a broader space that includes appUca- 
tion faults and distributed replication. 

Enterprise storage systems have recognized the need 
to address latent and correlated faults. Network Appli- 
ance's storage threat model includes two whole-disk fail- 
ures, and a whole-disk failure with a latent fault discov- 
ered during recovery (our P(L2 \Vi)). They employ row- 
diagonal parity and suggest periodic scrubbing of disks 
to improve reUabihty [1 1]. Schwarz et al. [50] show that 
opportunistic scrubbing, piggy-backed on other disk ac- 
tivity, performs very well. Like us, they do not depend 
on the disk to detect a latent fault but actually check the 
data. Our exploration also includes higher-layer failures. 

Database vendors have implemented some 
application-level techniques to detect corruption. 
DB2's threat model includes a failure to write a database 
page spanning multiple sectors atomically. It sprinkles 
page consistency bits through each page, modifying and 
checking them on each read and write. These bits detect 
some other forms of corruption but only those affecting 
the consistency bits [36]. Tandem NonStop systems 
write checksums to disk with the data, and compare 
them when data are read back [8]. 

The IRON File System [44] is a file system whose 
threat model includes latent faults and silent corruption 
of the disk. It protects the file system's metadata (but not 
the data) using checksums and in-disk replication. 

An alternative to tightly-coupled replication such as 
RAID is more loosely-coupled distributed rephcation. 
In 1990 Saltzer suggested that digital archives need ge- 
ographically distributed replicas to survive natural dis- 
asters, that they must proactively migrate to new media 
to survive media obsolescence, and that heavy-duty for- 
ward error correction could correct corruption that can 
accumulate when data are rarely accessed [49]. More 
recently, these ideas inform the design of the British Li- 
brary's digital archive [7]. 

Many distributed peer-to-peer storage architectures 
have been proposed to provide highly-available and 
persistent storage services, including the Eternity Ser- 



vice [5], Intermemory [10], CPS [12], OceanStore [26], 
PAST [48], and Tangier [57]. Their threat models vary, 
but include powerful adversaries (Eternity Service) and 
multiple failures. Some (OceanStore) use cryptographic 
sharing to proUferate n partial replicas from any m < n 
of which they can recover the data. Others (PAST) repli- 
cate whole files. Weatherspoon's [58] model compares 
the reUability of these approaches. While the model it- 
self does not include latent errors, correlated errors, or 
others such as operational or human errors, it takes into 
account the storage and bandwidth requirements of each 
approach. Later work [59] identifies correlations among 
replicas (e.g., geographic or administrative) and informs 
the replication poUcy to reduce correlation effects. 

Deep Store [61] and the LOCKSS system [33] share 
the belief that preserving large amounts of data for long 
periods affordably is a major design challenge. Deep 
Store addresses the cost issues by eliminating redun- 
dancy, LOCKSS by using a "network apphance" ap- 
proach to reducing system administration [47], and large 
numbers of loosely coupled replicas on low-cost hard- 
ware. Both recognize the threats of "bit rot" and format 
obsolescence to long-term preservation. 

8 Conclusions and Future Work 

In this paper we motivate the need for long-term storage 
of digital information and examine the threats to such 
information. Using an extended reUabiUty model that in- 
corporates latent faults, correlated faults, and the detec- 
tion time of latent faults, we reason about possible strate- 
gies for improving long-term reliability of these systems. 
The cost of these strategies is important, since limited 
budget is one of the key threats to digital preservation. 
We find that the most important strategies are auditing to 
detect latent faults as soon as possible, automating repair 
so that it is as fast, cheap, and as reliable as possible, and 
increasing the independence of data replicas. 

This is clearly a work in progress. We do not yet 
have data to characterize all the terms in our model. The 
model thus points to several data collection projects that 
would be very useful. For instance, we need more data 
on the mean time to different types of latent faults, the 
repair times for faults once detected, and the levels of 
correlation of different kinds of faults. We are currently 
seeking out sources of these data and instrumenting sys- 
tems to gather more. 

A digital preservation system should rarely suffer in- 
cidents that could cause data loss. Thus the total expe- 
rience base available to designers of such systems will 
grow very slowly with time, making it difficult to iden- 
tify and fix the problems that will undoubtedly arise. Past 
incidents indicate that it will in practice be very diffi- 
cult to accumulate this experience base. The host or- 
ganization's typical response to an incident causing data 
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loss is to cover it up. A few details become known via 
the grapevine, but the true story never becomes part of 
the experience base. A system similar to NASA's Avia- 
tion Safety Reporting System [37] should be established, 
through which operators of storage systems could submit 
reports of incidents, even those not resulting in data loss, 
for others to read in anonymized form and from which 
they can learn how to improve the reliability of their own 
systems. 
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